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Abstract 

The wide applicability of kernels makes the problem of 
max-kernel search ubiquitous and more general than the 
usual similarity search in metric spaces. We focus on solv- 
ing this problem efficiently. We begin by characterizing 
the inherent hardness of the max-kernel search problem 
with a novel notion of directional concentration. Follow- 
ing that, we present a method to use an 0(?ilogn) algo- 
rithm to index any set of objects (points in MP or abstract 
objects) directly in the Hilbert space without any explicit 
feature representations of the objects in this space. We 
present the first provably O(logn) algorithm for exact 
max-kernel search using this index. Empirical results for 
a variety of data sets as well as abstract objects demon- 
strate up to 4 orders of magnitude speedup in some cases. 
Extensions for approximate max-kernel search are also 
presented. 



1 Introduction 

MfLX-kernel search. In this paper, we present a novel 
algorithm to accelerate the following general task of max- 
kernel search (MKS): for a given set 5 of n objects (the 
reference set), a Mercer kernel function /C(-,-), and a 
query q, find the object p G S such that: 



(1) 



p — argmax/C((7, r). 



This general form of problem is ubiquitous in computer 
science; it can be easily seen as a similarity search (for 
some similarity function /C(-,-)). The most simple ap- 
proach to this general problem is a linear scan over all 
the objects in S. However, the computational cost of this 
approach scales linearly with the size of the reference set 
(151 = n) for a single query, making it computationally 
prohibitive for large data sets. 

A Mercer kernel can define a measure of similarity for 
classes of objects including points in R®, and extending 
to objects which do not have natural fixed-length repre- 
sentations. Kernels are ubiquitous and can be devised 
for any new class of objects, such as images and docu- 
ments (which can be considered as points in M®), to more 
abstract objects like strings (protein sequences [1], text), 
graphs (molecules [2j, brain neuron activation paths), and 
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Figure 1: Matching images - an example of MKS with 
the linear, cosine, Gaussian and polynomial kernels as 
the usual similarity functions. 

time series (music, financial data) |3]. The beauty of ker- 
nels is the renowned "kernel trick" - the ability to eval- 
uate similarity (inner products) between any pair of ob- 
jects in some feature space without requiring the explicit 
feature representations of those objects. 

A special case is the problem of nearest-neighbor search 
(NNS) in metric spaces, in which the closest object to the 
query with respect to a distance metric is sought. How- 
ever, the requirement of a distance metric makes many 
efficient methods for exact and approximate NNS |4j un- 
applicable to the general MKS. 

Some large-scale applications. The widely studied 
problem of image matching in computer vision is a MKS. 
The sets of images are of the order of millions and still 
growing, making linear scan computationally prohibitive. 
MKS also appears in maximum-a-posteriori inference [5] , 
as well as collaborative filtering (via the widely successful 
matrix factorization framework) [6]. This matrix factor- 
ization obtains an accurate representation of the data in 
terms of user vectors and item vectors, and the desired 
result - the user preference of an item - is the inner- 
product between the two respective vectors (a Mercer ker- 
nel). With ever-scaling item sets and millions of users [3, 
reducing the cost of retrieval of recommendations (which 
is a MKS) is critical for real-world systems. Finding sim- 
ilar protein/DNA sequences for a query sequence from 
a large set of biological sequences is also an instance of 
MKS with biological sequences as the objects and vari- 
ous domain-specific kernels (for example, the p-spectrum 
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kernel [T], maximal segment match score [5] &: Smith- 
Waterman alignment score foQ)- An efhcient MKS algo- 
rithm can be directly applied to biological (sub)sequence 
matching. 

Contributions. To address the breadth of applications 
of the MKS problem, we present a method to accelerate 
max-kernel search for any class of objects with a corre- 
sponding Mercer kernel. Our contributions are: 

• The first concept for characterizing the hardness of 
MKS in terms of the concentration of the kernel values 
in any direction - the directional concentration. 

• An 0{n log n) algorithm to index any set of objects di- 
rectly in the Hilbert space defined by the kernel without 
requiring explicit representations of the objects in this 
space. 

• A novel branch-and-bound algorithm on the index in 
the Hilbert space, which can achieve orders of magni- 
tude speedups over linear search. 

• Value- approximate and order-approximate extensions 
to the exact max-kernel search algorithm. 

• The first O(logn) runtime bound for exact MKS with 
our proposed algorithm for any Mercer kernel. 

1.1 Related work 

Although there are existing techniques for MKS, almost 
all of them solve the approximate problem under restricted 
settings. The most common assumption is that the ob- 
jects are points in some metric space and the kernel func- 
tion is shift-invariant - a monotonic function of the dis- 
tance between the two objects {JC{p,q) = f{\\p — q\\)), 
such as the Gaussian radial basis function (RBF) kernel. 
For shift-invariant kernels, a tree-based recursive algo- 
rithm has been shown to scale to large sets for maximum- 
a-posteriori inference [S]. However, a shift-invariant ker- 
nel is only applicable to objects already represented in 
some metric space. In fact, the MKS with a shift- 
invariant kernel is equivalent to NNS in the input space 
itself, and can be solved directly using existing methods 
for NNS, an easier and better-studied problem. For shift- 
invariant kernels, the points can be explicitly embedded 
in some low-dimensional metric space such that the inner 
product between the representations of any two points 
approximates their corresponding kernel value |10) . This 
step takes 0{n'D^) time for S C and can be followed 
by NNS on these representations to accomplish MKS. 

Locality-sensitive hashing (LSH) [TT] is widely used for 
image matching, but only with explicitly representable 
kernel functions which admit a locality sensitive hash- 



^The last two functions provide matching scores for pairs of se- 
quences and can easily be shown to be Mercer kernels. 



ing function P^pl . Kulis and Grauman [T3] apply LSH 
to solve MKS approximately for normalized kernels with- 
out any explicit representation. Normalized kernels re- 
strict the self-similarity value to a constant {K.{x,x) = 
/C(y, y) V a;, y e S). The preprocessing time for this LSH 
is 0{p^) and a single query requires 0{p) kernel evalu- 
ations. Here p controls the accuracy of the algorithm - 
larger p implies better approximation; the suggest value 
for p is 0{^/n) with no rigorous approximation guaran- 
tees. 

A recent work [14] proposed the first technique for 
exact MKS using a tree-based branch-and-bound algo- 
rithm but is restricted only to linear kernels and the al- 
gorithm does not have any runtime guarantees. The au- 
thors suggest a method for extending their algorithm to 
non-representable spaces with general Mercer kernels but 
require O(n^) preprocessing time. 

A case for un-normalized kernels. While some of 
the kernels used in machine learning (for example, the 
Gaussian and the cosine kernel) are normalized, some 
common kernels like the linear kernel (and the polyno- 
mial kernels) are not normalized. Many applications re- 
quire un-normalized kernels: (1) In the retrieval of rec- 
ommendations, the normalized linear kernel will result 
in inaccurate user-item preference scores. (2) In biologi- 
cal sequence matching with the domain-specific matching 
functions, lC{x, x) implicitly corresponds to presence of 
(genetically) valuable letters (like W, H, P) or invaluable 
letters (like xjl in the sequence x. This crucial infor- 
mation is lost in kernel normalization. In this paper, we 
consider general un-normalized kernels as well as the spe- 
cial case of normalized kernels. 

None of the existing techniques can be directly applied 
to every instance of MKS with general Mercer kernels 
and any class of objects (Equation [1]). Moreover, almost 
all existing techniques resort to approximate solutions. 
In this paper, we present a method to perform the exact 
max-kernel search with 0{n log n) preprocessing time and 
O(logn) query time, as well as approximate methods with 
rigorous accuracy guarantees. 

This paper. In the following section, we contrast MKS 
to NNS, investigate the inherent hardness of MKS and 
motivate the applicability of indexing schemes used for 
NNS in MKS. Section [3] discusses the construction of a 
tree in the kernel space without explicit representations 
of objects. SectionUpresents the novel branch-and-bound 
algorithm for exact MKS and extends it to approximate 
MKS. Sections[5]and|n]examine the theoretical and empir- 
ical performance of the approach, respectively. Section [7] 



^The Gaussian and the cosine kernel admit locality sensitive 
hashing functions with some modifications. 

■^See the score matrix for letter pairs in protein sequences at 
http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt. 
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extends the proposed method to the distributed setting 
and we conclude with Section [H 



2 Max-kernel search 

A Mercer kernel implies that the kernel value for a pair 
of objects {x,y) corresponds to an inner-product between 
the vector representation of the objects ip{x), (p{y) in some 
inner-product space % {lC{x,y) — {ip{x) , ^{y)) u) ■ Hence 
every Mercer kernel induces the following metric in %: 



(2) 



dic{x,y) = ||<^(a::) - (p{y)\\ 

= ^/JC{x,x) +IC{y,y) - 21C{x,y). 



Whenever MKS can be reduced to NNS in H, NNS meth- 
ods for general metrics [4] can be used for efficient MKS. 
In this section, we show that this reduction is possible 
only under strict conditions. 

Subsequently, we discuss the hardness of MKS and con- 
trast it to the hardness of NNS. Finally, we discuss de- 
sirable properties of certain NNS techniques and their 
applicability to MKS. 

Possible reductions and conditions. The nearest 
neighbor for a query q in H (argmin^gs cij<;((j', r)) is the 
max- kernel candidate (Eq. ([T])) if /C(-, •) is a normalized 
kernel. The two problems can have very different answers 
with un-normalized kernels. Every kernel also induces a 
cosine similarity (^JC{x, y)/\/ IC{x, x)IC{y, y)^ in H. Sim- 
ilar to the previous case, the object with the maximum 
cosine-similarity in H is the max-kernel candidate only 
for normalized kernels. 

Hardness of nearest-neighbor search. However, even 
if MKS can be reduced to NNS, NNS is still hard for 
general metrics (J7(n) for a single query) without any as- 
sumption on the structure of the metric and the set (S', d). 
Here we present one notion of characterizing the hardness 
in terms of the structure of the metric |15) : 

Definition 2.1. Let Bs{p,A) ^ {r e S: d{p,r) < A} 
denote a closed ball of radius A around some p d S with 
respect to a metric d. Then, the expansion constant of 

{S,d) is defined as the smallest c > 2 such \Bs{p,2A)\ < 
c \Bs{p, A)| Vp G S' and VA > 0. 

The expansion constant effectively bounds the num- 
ber of points that could be sitting on the surface of a 
hyper-sphere of any radius around any point. If c is high, 
NNS could be expensive. A value of c ^ Vl{n) implies 
that the search cannot be better than linear scan asymp- 
totically. Under the assumption of bounded expansion 
constant, NNS methods with sub- linear/logarithmic the- 
oretical runtime guarantees have been presented [HI [16] . 



2.1 Characterizing the hardness of MKS 

The kernel values for a query are proportional to the 
length of the projections in the direction of the query in 
T-L. While the hardness of NNS depends on how concen- 
trated the surface of spheres are (as characterized by the 
expansion constant), the hardness of MKS should depend 
on the distribution of the projections in the direction of 
the query. This distribution can be characterized using 
the distribution of points in terms of distances. // two 
points are close in distance, then their projections in any 
direction are close as well. However, if two points have 
close projections in a direction, it is not necessary that the 
points themselves are close to each other. It is to charac- 
terize this reverse relationship between points (closeness 
in projections to closeness in distances) that we present 
a new notion of concentration in any direction: 

Definition 2.2. Let IC{x,y) — {(p{x), (p(y))-H be a Mer- 
cer kernel, d]c{x,y) be the induced metric from K, and 
Bs{p,A) denote the corresponding closed ball of radius 
A around a point p in H. Let Is{v,[a,b]) = {r G 
S: {v,if{r))'n G [a,^]} be the set of objects in S pro- 
jected onto a direction v in % lying in the interval [a, b] 
along v. Then, the directional concentration con- 
stant of {S, K) is defined as the smallest 7 > 1 such that 
Vu G "H such that \\u\\^ = 1, \fp G S and VA > 0, the set 
Is{u, [{u, ip[p))n - A, (u, (pip))H + A]) can be covered by 
at most 7 balls of radius A. 

The directional concentration constant is not a measure 
of the number of points projected into a small interval, 
but rather a measure of the number of "patches" of the 
data in a particular direction. For a set of points to be 
close in projections, there are at most 7 subsets of points 
which are close in distances as well. Consider the set 
of points B = Is{q,[a,b]) projected into an interval in 
some direction (Figure 2(a)). A high value of 7 implies 
that the number of points in B is high but the points 
are spread out and the number of balls (with diameter 
equal to |6 — a|) required to cover all these points is high 
as well (each point possibly requiring an individual ball) . 
Figure 2(c) provides one such example. A low value of 



7 implies that either B has a small size or the size of 
B is high and B can be covered with a small number of 
balls (of diameter \b — a\). Figure 2(b) is an example of 



a set with low 7. The directional concentration constant 
bounds the number of balls of a particular radius required 
to index points that project onto an interval of size twice 
the radius. 

2.2 Desirable existing techniques 

The notion of bounded directional concentration im- 
plies that indexing schemes capable of efficiently index- 
ing points in a particular direction might be useful for 
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Figure 2: Concentration of projections. 



the task of MKS. Indexing schemes hke space- partitioning 
trees have been widely used for NNS with success in many 
cases. Trees offer good hierarchical indexing schemes in 
low to medium dimensions; for high-dimensional data, 
trees which exploit some low-dimensional structure have 
been developed [T71 [TB]. These hierarchical indexing 
schemes lend easily to intuitive branch-and-bound algo- 
rithms for efficient solutions (especially approximate so- 
lutions [T51 [in])- In addition, tree-based branch-and- 
bound algorithms are essentially incremental algorithms, 
and thus easily extend to anytime algorithms ^U\. Most 
importantly, trees only require a single construction - dif- 
ferent algorithms can work on the same tree; in addition, 
once a tree is built, point insertions and deletions are 
generally cheap. 

Which tree to use for MKS? Given the numerous ad- 
vantages of trees, we wish to select an appropriate tree for 
efficient MKS. The following two properties are desired of 
any indexing scheme used for MKS: 

• Explicit representation of the objects is not required. 

• It should have sub-quadratic construction time. 

The fcd-tree [H] and the metric tree [22] exhibit good 
characteristics and are widely used in NNS. However, the 
kd-tree requires the explicit representation of the points 
(f{x) in y, for its axis-aligned splits. For similar reasons, 
random-projection trees [17] and PCA-trees [23] cannot 
be used for MKS. Metric trees can be constructed without 
the explicit representations since they can work with only 
the ability to evaluate the induced metric djc [2ipl . In this 
paper, we consider the recently formulated cover tree [TB] 
for MKS. It provides a systematic way to build a tree 
without explicit object representations and with a sub- 
quadratic construction time. A key difference between 
cover trees and the aforementioned trees is that the kd- 
tree and the metric tree are binary trees while the cover 
trees can have multiple number of children. Moreover, 

Calculation of the explicit mean ^ of a node is avoid- 
able by using the kernel trick for operations on the mean: 
((/i, ifi{q)) = X^^gy !C{x, ci')/|r|) (where T is the set of points in the 
node). However, this makes the tree construction and tree search 
computational prohibitive for efficient MKS. 



the time complexities of building and querying a cover 
trees have been analyzed extensively TBI [25] , whereas kd- 
trees and metric trees have been analyzed only under very 
limited settings |21j . 

3 Indexing in kernel space 
with cover trees 

A cover tree stores a dataset S of size n in the form of a 
levelled tree. The structure has 0{n) space requirement 
(Theorem 1 in [TB]). Each level is a "cover" for the level 
beneath it and is indexed by an integer scale i which 
decreases as the tree is descended. Let Ci denote the set 
of nodes at scale i. For all scales i, the following invariants 
hold: (i) (nesting invariant) d C Ci_i, (ii) (covering tree 
invariant) Vp S Ci-i,3q G Ci: d{p,q) < 2\ and exactly 
one such q is a parent of p. (iii) (separation invariant) 
For all p,q e Ci, d{p, q) > 2\ 

Although the cover tree is defined as infinite levelled 
sets, the tree has a precise finite representation. As ex- 
plained in |25j . "The implicit representation consists of 
infinitely many levels Ci with the level Coc containing a 
single node which is the root and the level C_oo contain- 
ing every point in the dataset as a node. The explicit 
representation is required to store the tree in 0{n) space. 
It coalesces all nodes in the tree for which the only child 
is the self-child. This implies that every explicit node ei- 
ther has a parent other than the self-parent or has a child 
other than a self-child." 

Tree construction in the kernel space H. Every 
node in a cover tree is associated with a single point p. 
Hence, the cover tree construction only requires distances 
between the points in the set (the tree construction algo- 
rithm is presented in Section A of the supplement); and 
therefore construction in H does not require any explicit 
representation in %. From Equation [21 the pairwise dis- 
tance dic{x,y) can be evaluated with only three kernel 
evaluations. If the self-kernel values JC{x,x) V x G S are 
precomputed and saved, dic{x,y) only requires one eval- 
uation of /C(x,?/). The tree construction time is bounded 
by the following theorem: 
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Theorem 3.1. (Theorem 3.6 in f^) For a data set S 
of n objects and a metric djc with expansion constant c, 
the tree construction requires at most 0{c^n\ogn) time. 

Remark. We would like to point out that while we con- 
sider the cover tree to implicitly index points in % with- 
out any modification, the cover tree has only been used 
for NNS to this date. The novelty of this paper is a new 
branch-and-bound algorithm using this tree (Section!?]) to 
solve the general task of MKS with provable theoretical 
guarantees (Section [S]) and supporting empirical evidence 
(Section ID). 

4 Simple branch-and-bound 
algorithm 

In this section, we present a simple branch-and-bound al- 
gorithm on the cover tree for MKS. Branch-and-bound is 
widely used in NNS with the help of the triangle inequal- 
ity of the distance metric. In the absence of the triangle 
inequality for kernel functions, we obtain a novel bound 
on the max-kernel value possible between a query and any 
subtree of a cover tree. Then we present the algorithm 
for exact and approximate search. 

Bounding the max-kernel value. A cover tree node 
is defined by an object p and a level i. Let S'p denote the 
set of objects in the subtree rooted at a node defined by 
object p at level i. In the following theorem, we bound 
the maximum possible kernel value between a query and 
an object in the subtree of the cover tree. For notational 
convenience, we will denote max^e/j AI!(q, r) as )C{q,R). 

Theorem 4.1. Given a cover tree node rooted at an ob- 
ject p at level i in the kernel space % and a (query) object 
q, the maximum kernel function value between q and any 
object in the set is bounded as: 



(3) 



JC{q,S;)<IC{q,p) 



Proof. Suppose that p* is the best possible match in the 
set Sp for q and let m be a unit vector in the direction 
of the line joining ip{p) to ^p{p*) in %. Then ^p{p*) = 
(/?(p) -I- A • u where A — dic{p,p*) is the distance ip{p) and 
the best possible match fip*) (see Figure [3]). Then we 
have the following: 



(4) 



JC{q,S;) = K{q,p*)^{^{q),^{p*))u 

< (^(9),'^(P)>« + A||^(g)||„, 



bound A from above using the covering invariant - for 
any cover tree node p at level i, the distance to the far- 
thest child node is bounded by 2*. Applying this bound 
recursively with the triangle inequality of dx: gives us 
A = dK.{p,p*) < Ej=-oo2^ = 2*+^ The statement of 
the theorem follows. □ 



t¥'(q) 








/ A • u\ /2'+^ \ 







o 

Figure 3: Max-kernel upper bound. 

For normalized kernels {K.{x,x) = IV x e S), all the 
points are on the surface of a hyper-sphere in %. In this 
case, the above bound in Theorem 14. II is correct but pos- 
sibly loose. In the following theorem, we present a tighter 
bound specifically for this condition: 

Theorem 4.2. Consider a kernel K such that K{x,x) = 
IV X e S*. Given a cover tree node rooted at an object p at 
level i in H and a ( query) object q, the maximum kernel 
function value between q and any object in the set Sp is 
bounded as: 



r/c(<z,p)(i-2^'+^) 



+ 2^+1 v/(l-/C(q,p)2)(l-22') 
»//C(g,p)<l-22'+i 
^1.0, otherwise 



(5) ic{q,s;)< 



The proof is similar to the proof of Theorem 14.11 and 
presented in Section B of the supplement. 

Algorithm 1 FastMKS fTree T, query q) 

Initialize i?oo = Coo // the root of the tree 
for « = oo to — oo do 

R — {Children{r): r G Ri\ // tree descend 
R,-i = {reR:ICiq,r)>K{q,R) ~ 
end for 

return argmax^gfl 



2*v/^q:9)} 

K.{q,r) // the leaf nodes. 



where the last inequality follows from the Cauchy- 
Schwartz inequality {{x,y) < \\x\\ \\y\\) and the fact that 
llwll = 1. From the definition of the kernel function, 
Equation m gives us IC{q, Sp 



The branch-and-bound algorithm. The bound on 
JC{q, Sp) is used to decide whether a node is retained for 
further exploration or removed ( "pruned" ) from consider- 
ation. The branch-and-bound algorithm FastMKS(T, q) 
is presented in Alg. [1] If the maximum possible kernel 
value from a subtree is less than the current best ker- 
< /C(g,p) + A^/C((7, q). We nel value, that subtree is not explored any further. The 
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Figure 4: Branch-and-bound tree-traversal - the green nodes are retained and the red nodes are pruned. 



best possible kernel value from a subtree and a step of 
the algorithm is depicted in Figure SI For a retained 
node, all its children are explored. The correctness of 
FastMKS(r,g) follows from: 

Theorem 4.3. // T is a cover tree on S, then 
FastMKS{T,q) (Alg. Qp returns argmax^gs ^^((j, r). 

We present a sketch proof. The complete proof is pre- 
sented in Section C of the supplement. 

Pro of, (ske tch) By TheoremlHl /C(g,i?,_i) > lC{q,S) - 
2'JK.{q,q) V i. Therefore, lim /C(q,i?,_i) = IC{q,S). 

i— f — oo 

Hence argmaxrefl_oo ^(9,^) = argmaxre5/C(q, r). □ 

Approximate max-kernel search. Similarity search 
problems can be approximated for further scalability. 
Even though we are focusing on exact max-kernel search 
in this paper, we wish to demonstrate that the tree based 
method can be very easily extended to perform the ap- 
proximate max-kernel search. Approximation can be 
achieved in the following ways: 

1. Absolute value approximation (AVA): return p £ S 
such that JC{q,p) > lC{q, S) — e. 

2. Relative value approximation (RVA): return p Cz S 
such that IC{q,p) > IC{q, S) - e\IC{q, S)\. 

3. Rank approximation (RA): return p Cz S such that 
\{reS:JC{q,r)>JC{q,p)}\<T. 

Care has to be taken for relative value approximation 
since there is no guarantee that JC{q,S) > 0. The follow- 
ing stopping rules can be used at the end of an iteration 
in the for loop in FastMKS(r, q) (Alg. P for the value 
approximations. The best candidate up until then is the 
approximate solution. 

1. AVA stopping: if e > 2'^/IC{q,q), stop. 

2. RVA stopping: assuming JC{q,S) > 0, if /C(g,i?i_i) > 
(2Ve) V^MI, stop! 

Theorem 4.4. The AVA stopping rule returns a point 
p Cz S such that IC{q,p) > IC{q, S) — e. 
Proof. At the end of iteration at level i, /C(g, > 
JC{q, S) - 2V/C((7,q). If 2'^/IC{q,q) < e, the AVA condi- 
tion is satisfied. □ 

^The stopping rule can be easily modified for JC{q, S) < 0. 



Theorem 4.5. Assuming IC{q, S) > 0, the RVA stop- 
ping rule returns a point p £ S such that )C{q,p) > 
(l-e)/C(g,5). 

Proof When /C(g,i?,_i) > (2Ve) y^JC{q,q), then 
2'y/IC{q,q) < e/C((?,i?j_i) < eK{q,S). Using this, we 
have IC{q,R,-i) > /C(q,S') - 2'^lC{q,q) > /C(q,S') - 
eJCiq.S). □ 

Rank-approximation can be achieved by performing strat- 
ified sampling on the cover tree [TH]. The technique is 
more involved and presented in Section D of the supple- 
ment for the lack of space. 

5 Runtime analysis 

The runtime analysis of FastMKS will make use of the 
following results fllf : 

Lemma 5.1. The number of children of any node p is 
bounded by . 

Lemma 5.2. The maximum depth of any point p in the 
explicit representation is 0{c^ logn). 

The main result of this section is the search time com- 
plexity of FastMKS (T, q) in terms of the number of ob- 
jects in S and the properties of {S, K): 

Theorem 5.1. Given a Mercer kernel K,, if the dataset 
S of size n has an expansion constant c (with the 
metric d/c ) and a directional concentration constant 7, 
FastMKS{T,q) requires 0{c^^^^ logn) time. 
Proof. The first part of the proof is similar to the run- 
time analysis of NNS with cover trees [I6j. Let R* de- 
note the last explicit Ri considered by the algorithm. By 
Lemma 15. 2| the explicit depth of any point in R* is at 
most k — 0{c^ logn). The maximum number of iter- 
ations required would be at most fc|i?*| < fcmaxj|i?i|. 
The amount of work done in each iteration is at most 
0(max,i \Ri\), hence resulting in a total of 0(fc max^ 
work. Moreover, from Lemma 15.11 the total number of 
children encountered throughout the whole algorithm is 
at most kma,Xi\Ri\ ■ c*. Hence, the pruning/retaining 
rule does at most 0(/c max^ \Ri\ ■ c*) work. Also, R-00 < 
maxi \Ri\. Therefore, the algorithm requires at most 
0(fcmaxi + fcmaxi \Ri\ ■ c'') time. 
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Now we bound maxi \Ri\. Let u = (p{q)/ \\ip{q)\\. Then 
Is{viq),[a,b]) - Is{u,[a/\\<fiq)\\,b/\\ip{q)\\]). For any 
level i, let R = {Children(r) : r G and let k = JC{q, R) 
and K* = K,{q, S). Then 

= {r^R:JC{q,r)>K^2'\\^{q)\\} 

= Is{v{q)A^-'^'\Wm,^*])^R 
C /5(^((7),[«;-2'||^(g)||,«;*])na_i 

since k* < k + 2* ||</3((j)||. 

Then for any r G Is{v{q), - 2'+^ Mq)\\ , '^1), 

/5(^(g),[«*- 2^+^1^(9)11,^1) 

C Is{q, [K.{q,r) - Mq)\\ ,IC{q,r) + Mq)\\]) 

By the definition of the directional concentration con- 
stant, there exists rjs such that: 

Bounding the size of Is (g, [K - 2* \\ip{q)\\ , IC{q, S)])n d-i 
amounts to bounding the number of disjoint balls of ra- 
dius 2'"^ that can be packed into each of the 7 balls 
Bs{rj, 2*+^ -I- 2'~^). For each of the rj, we have: 

\Bsirj,T+^ + T-^)\ < \Bs{r',T+^ + T~^)\ 
< \Bs{r',T+')\<c''\Bsir,T-^)\. 

Hence, V i, \Ri\ < 70^. Hence max^ \Ri\ < jc^, thus 
giving us the statement of the theorem. □ 

Comparing to the query time 0{c^^ logn) for NNS |16| . 
it is clear that FastMKS has similar logn scaling, but 
also has an extra price of 7^ for solving the more general 
problem of max-kernel search. 

6 Evaluation 

We evaluate the FastMKS algorithm with different ker- 
nels and datasets. For every experiment, we query the 
top {1,2,5,10} max-kernel candidates and report the 
speedup over linear search (in terms of the number of 
kernel evaluations performed). The cover tree and the 
algorithm is developed in MLPACK [57j with the imple- 
mentation details in Section E of the supplement!! 

Datasets. We use two different classes of datasets - 
(1) Datasets with fixed-length objects: These include 
the MNIST dataset [33], the Isomap "Images" dataset, 

®See http://www.mlpack.org/ for more information on ML- 
PACK. 



several datasets from the UCI machine learning repos- 
itory [29], three collaborative filtering datasets (Movie- 
Lens, Netfiix [3D], Yahoo! Music [7]), the LCDM astron- 
omy dataset |3T] , the Live Journal blog moods text dataset 
[52] and a subset of the 80 million tiny images dataset [33] 
(the sizes of the datasets are presented in Table [5]). (2) 
Datasets without fixed length representation: We con- 
sider protein sequences from the GenBant0. 



Datasets 




\s\ 


dims 


Y! Music 


100000 


600000 


50 


MovieLens 


6040 


3706 


50 


Opt-digits 


450 


1347 


64 


Physics 


37500 


112500 


78 


Bio 


75000 


210409 


74 


Covertype 


150000 


431012 


55 


LiveJournal 


10000 


10000 


25327 


MNIST 


10000 


60000 


784 


Netfiix 


480189 


17770 


50 


Corel 


10000 


27749 


32 


LCDM 


6000000 


10777216 


3 


Tinylmages 


5000 


1000000 


384 



Table 1: Details of the vector datasets. 



Kernels. We consider all of the following kernels for 
the vector datasets: cosine, hyperbolic tangenl^ linear, 
and polynomial (with degree 10). While FastMKS is 
applicable to any kernel, MKS with the Gaussian kernel 
reduces to NNS in the input space; hence, we omit this 
kernel from our experiments. The p-spectrum kernel [1] 
is used for the protein sequence data. 
Results. The results for the vector datasets are summa- 
rized in Figure [5] While the speedups range from any- 
where between 1 to over 10^, a speedup close to an or- 
der of magnitude is seen in most large datasets (except 
MNIST). It is also quite clear that different kernels give 
very different speedups for the same dataset. This can be 
attributed to the fact that the expansion constant and the 
directional concentration constant are properties of the 
dataset-kernel pair. The results for the protein sequence 
data (Figure [6]) indicate that the speedups increase with 
increasing reference set size, recording a speedup of over 
two orders of magnitude for a set of around 100000 se- 
quences, exhibiting the logarithmic scaling of FastMKS. 



7 Distributed data 

Despite the theoretical and empirical efficiency of 
FastMKS, the algorithm still requires the dataset (and 

^ftp://ftp. ncbi.nih.gov/refscq/release/complete/ 
*Whilc the hyperbolic tangent Iternel is not Mercer in general, 
our proposed algorithm works correctly with non-Mercer kernels if 
the kernel is positive definite when restricted to the dataset. 
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(a) Cosine and hyperbolic tangent kernels 

Figure 5: Speedups over linear scan for FastMKS with k 



F > 

(b) Linear and polynomial (degree 10) kernels 



1,2,5,10. 
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9524 16151 37789 94713 

Set size 

Figure 6: Speedups over linear scan for FastMKS with 
= 1, 2, 5, 10 with sets of protein sequences of increasing 
size to demonstrate the scaling of FastMKS. 

the tree) to fit completely in the memory of a single ma- 
chine. Large datascts generally need to be distributed 
across multiple nodes to load the whole data in memory. 
While a tree can be constructed on distributed data, this 
tree is inefficient for MKS because of the high internode 
communication in the branch-and-bound algorithm. In 
this section, we discuss a method to extend FastMKS 
to the distributed data setting and compare it to the ob- 
vious baseline. Since communication and data exchange 
are major bottlenecks in parallel and distributed systems, 
our proposed method and the baseline avoid these bot- 
tlenecks as well as fit into the map-reduce framework. 

Linear scan baseline. Indexing schemes are generally 
not conducive to distributed data and distributed linear 
scan is a possible alternative. Let the dataset of size n be 
evenly distributed over m nodes with each node contain- 
ing (n/m) objects and let there be a master node that 
communicates with each of these m nodes. For single 
query, the master sends the query to each of the m nodes 



in the map phase and the mappers on every node per- 
form a linear scan in parallel. The best match from each 
of these nodes is returned to the master node which can 
then return the best among those m returned matches in 
the reduce phase. Then the total time required to ob- 
tain the best match is (cim + C2^). It is important to 
note that each of the m nodes containing the actual data 
performs (n/m) kernel evaluations while the master node 
just sorts m kernel values; therefore, Ci ^ C2. The min- 
imum possible runtime achievable by linear scan in this 
distributed system is 



2^ciC2n ^ 0{\/n) with m = yj C2n/ci. 

Multiple tree indices. Let us consider this following 
scheme for indexing: for each of the m nodes containing 
{n/m) objects, build and save a cover tree on the data in 
each of the nodes. Similar to the linear scan case, for a 
single query, the master sends the query to each of the m 
nodes in the map phase and the mappers on every node 
perform the branch-and-bound search on the cover trees 
in parallel. The best matches from each of m trees are 
returned to the master node which then reports the best 
among them in the reduce phase. Since we have shown 
that a single query on a cover tree requires logarithmic 
time, let the time taken for this parallel search on multi- 
ple trees be (cim -|- c'2 log ^) , where C2 is the scaling con- 
stant independent of n for the cover tree on this dataset. 
In this scenario, the minimum possible achievable runtime 
is 

C2log(e n ci/c'2) ~ O(logn) with m = Cj/ci. 

Remark. The optimal values of m can only be used if 
there are enough number of nodes available and if {n/m) 
is small enough for the data in the node to fit in memory. 
However, searching over multiple tree indices is a simple 
yet efficient extension of our proposed method for dis- 
tributed data. In the best case, our method can achieve 
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0(log n) scaling while the distributed linear scan can only 
achieve 0{y^) scaling. More non-trivial extensions using 
recent technic^ucs [34 , 35j can achieve further efhciency in 
the distributed setting. 

8 Conclusion 

Our work appears to be the most comprehensive study 
of the general MKS problem to date, including the first 
rigorous characterization of its hardness and the first 
general-kernel method with provably logarithmic query 
time, sub-quadratic preprocessing time, and speed in 
practice. A tighter theoretical analysis without the 
bounded directional concentration constant assumption 
and an empirical investigation on more abstract objects 
like graphs and time series will be part of our future work. 
We wish to extend our search algorithm to more classes 
of similarity functions beyond Mercer kernels. 



Appendix 

A Cover tree construction 
in Hilbert space 

The details of the cover tree construction are provided 
in Algorithm [5] For a cover tree built on dataset S, 
each node is only associated with a single point p (z S. 
Therefore, because the only distance computations in- 
volve points in S, the explicit representation of the objects 
in H is not required (by the kernel trick). 

Given a Mercer kernel K, for a class of objects, the 
distances between points in H required for the tree con- 
struction in H can be evaluated using the distance metric 
die induced from the kernel. Three kernel evaluations 
are required to compute the distance; however, if the 
self-kernel values IC{x,x)\f x S are precomputed and 
saved, dic{x, y) can be evaluated with a single evaluation 
oilC{x,y). 



Algorithm 2 Construct f p. (Near, Far) , i) [26] 

if Near = then 

return {p, 0}. 
else 

{Self, Near} = 

Construct(p, Split(d(p, ■), 2'"\ {Near}), i - 1) 
add Self to Children(p) 
while Near / do 
pick q in Near 
{Child, Unused} = 

Construct(g, Split(d(g, •), 2'"\ {Near, Far}), 
i-l) 

add Child to Children(p) 
{New-near, New-far} = 

Split(d(p,-),2\ {Unused}) 
Far Far U New-far 
Near <— Near U New-near 
end while 
return {p. Far}. 
end if 

Split(d(p,.),r,{5i,52,...}) 

Near = [j^{q e S,: d{p,q) < r} 
Far = U J9 e 5*, : 2r > d{p, q) > r) 
\/ i,S^^ S^\ (Near U Far). 
return {Near, Far}. 



The algorithm calls the recursive function Con- 
struct (p, (Near, Far) , «) where p is a point, 
(Near, Far) are the point sets and i is the cur- 
rent level of the tree. This recursive function calls 
a subroutine Split ((i(p, •), r, {5'i, 5*2, .. .}) with d{p,-) 
representing the set of distances of the points in the point 
sets {5*1, S'2, . . .} to the point p and r is the splitting 
distance to form the Near and Far sets. 
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B Proof of Theorem 4.2 

We restate the theorem for completeness: 

Theorem B.l. Consider a kernel K, such that IC{x, x) — 
IV X e S*. Given a cover tree node rooted at an object p at 
level i in H and a (query) object q, the maximum kernel 
function value between q and any object in the set Sp is 
bounded as: 



(6) iciq,s;) 



< 



/C(g,p)(l 

+ 2'+V(l-/C(g,p)2)(l-22^), 
tflC{q,p)<l 

1.0, otherwise 



■)2i+l 



Proof. Since all the points are sitting on the surface of a 
hypersphere in H, JC{q,p) denotes the cosine of the angle 
made by ip{q) and (p{p) at the origin. If we first consider 
the case where q lies within the ball bounding cover tree 
node p at level i (that is, if dK{q,p) < 2*+^), it is clear 
that the maximum possible kernel evaluation should be 
1, because there could exist a point in whose angle to 
g is 0. We can easily modify this condition to an easier 
condition on IC{q,p), which is JC{q,p) < 1 — 2^*+^. 

Now, for the other case, let cos 9qp = IC{q,p) and p* = 
argmaxrgsi /C(q, r). Let Opp* be the angle between tpij)) 
and (p{p*) and 0qp' be the angle between (p{q) and ipip*) 
at the origin. Then 

(7) K:{q,p*) = cosOqp* < cos{{0qp - epp*}+). 

Now we know that d]c{p,p*) < 2'+^ and that dic{p,p*) ~ 
— 2 cos 9pp> . Hence cos^^pp* > 1 — 2^*+^, and 9pp. < 
I cos-i(l- 22^+1)1 and take = | cos-i(l - 22*+i)|. Com- 
bining this with equation ((T]), we get: 

/C(<z, S;) < cos{{9qp - 9pp,}+) < cosi{9qp - 9}+). 

Substituting the value of 9 above and simplifying gives us 
the statement of the theorem. □ 



C Proof of Theorem 4.3 

We will restate the theorem here: 

Theorem C.l. // T is a cover tree on S, then 
FastMKS{T,q) returns argmax^gs /C((7, r). 

Proof. Let p* — argmax^gs /C(<7, r) and k* — IC{q,p*). 
At the beginning of the iteration at level oo, p* is in 
consideration since p* & S and for p £ Cod, = 
S. At the end of the iteration at level i, all points 
p e S such IC{q,p) > /C(g, - 2^y^lC{q, q) are 

still in consideration. Since n* > IC{q,Ri-i), p* is 
still in consideration. Either p* e Ri-i or p* is the 



grandchild of some point p E Ri-i- Hence, by the- 
orem 4.1, /C((7,i?,_i) > K* - 2'^yiC{q,q) V i. Now 

limi^_oo/C(g,i?,_i) > lim,,^_oo - 2''yjK.{q, q)^ = k* 

(assuming lC{q,q) < oo). Hence JC{q,R-oc) = and 
FastMKS(r, q) returns argmaxre_R_^ JC{q, r) ~ p* . □ 



D Rank-approximate 
max-kernel search 

The rank-approximation of the max-kernel search is de- 
fined as follows: for a given set 5 of n objects (the refer- 
ence set), a (Mercer) kernel function /C(-, •), and a query 
q, find the object p € S such that 

\{r e S: JC{q,r) > JC{q,p)}\ < r. 

As mentioned in Ram et.al, 2009 [19], the idea is to draw 
enough samples S' from the tree such that 

Pr (|{r e S: JC{q, r) > JC{q, S')}\ <t)>1-5. 

Simplifying the formulation presented in [TH], the proba- 
bility of always missing the top t values for a given query 
q after k samples with replacement is given by (1 — . 
If we want a (1 — i5) success rate of sampling, then we 
want k to be such that 



><5, 



giving k 



log S 



Given that k samples (as defined 



log(l-^) 

above) is to be made from the tree, the stratified sampling 
on a tree is presented in Algorithm [3l We assume that 
at each node of the tree, we have access to the number 
of points in the subtree S^. rooted at node r at level i for 
every node in the tree. This algorithm returns a r-rank 
approximate solution to the max-kernel operation with 
probability (1 — 5). 

Algorithm 3 RAFastMKS fTree T, query q, rank 
error r, failure probability 5) 

Initialize Rao = Coo // the root of the tree 

Set k = : — N // the number of samples 

// tree descend 



log 5 
log(l-^) 

for j = oo to — oo do 

R = {Children{r) : r £ Ri} 
R! = {r£R: K.{q,r) > K{q,R) - T ^]C{^} 
Ri-i = {r (z R' : \Sr \ > j} // approximate by sampling 

R—ao = R—oo U {-R' \ Ri—l}. 

end for 

return arg max K{q,r) // the leaf nodes. 
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E Implementation details 

We use the following performance-improving optimiza- 
tions on the cover tree: 

(1) Instead of the upperbound of 2'+^ for the distance of 
a cover tree node p at level i to its furthest descen- 
dant, the actual distance to the furthest descendant 
is cached at tree construction time and used at search 
time, 

(2) The distance to the parent is cached to avoid evalu- 
ating JC for the child nodes where an improvement is 
impossible, 

(3) Instead of the base 2, an experimentally verified base 
of 1.3 is used for best results [26] . 

(4) IC{x,x)\fx e 5 is precomputed and stored for future 
use to reduce the number of kernel evaluations re- 
quired for a single distance computation d]c{x,y) to 
one (just IC{x,y) since /C(x,x) and IC{y,y) are pre- 
computed), 

(5) We use the tighter bound in Theorem 4.2 for normal- 
ized kernels. 
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